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Abstract 

An identity between two versions of the ChcrnofF bound on the probability a certain large 
deviations event, is established. This identity has an interpretation in statistical physics, namely, 
an isothermal equilibrium of a composite system that consists of multiple subsystems of particles. 
Several information-theoretic application examples, where the analysis of this large deviations 
probability naturally arises, are then described from the viewpoint of this statistical mechanical 
interpretation. This results in several relationships between information theory and statistical 
physics, which we hope, the reader will find insightful. 

Index Terms: Large deviations theory, Chernoff bound, statistical physics, thermal equilib- 
rimir, equipartition, thermodynamics, phase transitions. 



1 Introduction 

Relationships between information theory and statistical physics have been extensively recognized 
over the last few decades, and they are drawn from many different aspects. We mention here only 
a few of them. 

One such aspect is characterized by identifying structures of optimization problems pertaining 
to certain information-theoretic settings as being analogous to parallel structures that arise in 
statistical physics, and then borrowing statistical-mechanical insights, as well as powerful analysis 
techniques (like the replica method) from statistical physics to the dual information-theoretic 
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setting of interest. A very partial list of works along this line includes [T], [13], [H], [18], [19], [20] 
[a], [22], [30] (and references therein), [3l], [32], [36], [37], [H], [12], [33], [S], and [iS]. 

Another aspect pertains to the philosophy and the application of the maximum entropy prin- 
ciple, which emerged in statistical mechanics in the nineteenth century and has been advocated 
during the previous century in a wide variety of more general contexts, by Jaynes [E] , [16] , [T7j , and 
by Shore and Johnson [40J, as a general guiding principle to problems in information theory (see, 
e.g., [3 Chap. 11] and references therein) and other areas, such as signal processing, in particular, 
speech coding (see, e.g., [H]) spectrum estimation (see, e.g., [4J), and others. 

Yet another aspect is related to ideas and theories that underly the notion of 'trading' between 
information bits and energy, or heat. In particular, Landauer's erasure principle [23j is argued to 
provide a powerful link between information theory and physics and to suggest a physical theory of 
information (comprehensive overviews are included in, e.g., [26j and [3_5j). According to Landauer's 
principle, the erasure of every bit of information increases the thermodynamic entropy of the world 
by kln2, where k is Boltzmann's constant, and so, information is actually physical. 

Finally, to shift gears more to the direction of this paper, we should mention the aspect of the 
interface between statistical physics and large deviations theory, a line of research advocated most 
prominently by Ellis [8], [9], and developed also by Oono [34J, McAllester j27j , and others. The main 
theme here evolves around the identification of Chernoff bounds and more general large deviations 
rate functions with free energies (along with their related partition functions), thermodynamical 
entropies, and the underlying maximum-entropy /equilibrium principle associated with them. In 
particular, Ellis' book [8j is devoted largely to the application of large deviations theory to the 
statistical physics pertaining to models of ferromagnetic spin arrays, like Ising spin glasses and 
others, in order to explore phase transitions phenomena of spontaneous magnetization (see also 

m)- 

This paper, which is mostly expository in character, lies in the intersection of information theory, 
large deviations theory, and statistical physics. In particular, we establish a simple identity between 
two quantities as they can both be interpreted as the rate function of a certain large deviations 
event that involves multiple distributions of sets of independent random variables (as opposed to 
the usual, single set of i.i.d. random variables). The analysis of this large deviations event is of 
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a general form that is frequently encountered in numerous applications in information theory (cf. 
Section 4). Its informal description is as follows: Let vi,...,Vn be an arbitrary (deterministic) 
sequence whose components take on values in a finite set V, and let Ui, . . . ,Un be a sequence of 
random variables where each component is generated independently according to a distribution 
q{ui\vi), i = 1,... ,n. For a given function / and a constant E, we are interested in the large 
deviations analysis (Chernoff bound) of the probability of the event 



assuming that the relative frequencies of the various symbols in (fi, . . . ,f„) stabilize as n grows 
without bound, and assuming that E is sufficiently small to make this a rare event for large n. 

There are (at least) two ways to drive a Chernoff bound on the probability of this event. The 
first is to treat the entire sequence of RV's, {f(Ui,Vi)} as a whole, and the second is to partition it 
according to the various symbols {vi}, i.e., to consider the separate large deviations events of the 
partial sums, f(Ui,v), v £ V, for all possible allocations of the total 'budget' nE among 

the various {v}. These two approaches lead to two (seemingly) different expressions of Chernoff 
bounds, but since they are both exponentially tight, they must agree. 

As will be described and discussed in Section 2, the identity between these two Chernoff bounds 
has a natural interpretation in statistical physics: it is viewed as a situation of thermal equilibrium 
(maximum entropy) in a system that consists of several subsystems (which can be of different 
kinds), each of them with many particles. 

As will be shown in Section 4, the above-described problem of large deviations analysis of 
the event ([T]) is encountered in many applications in information theory, such as rate-distortion 
coding, channel capacity, hypothesis testing (signal detection, in particular), and others. The above 
mentioned statistical mechanical interpretation then applies to all of them. Accordingly, Section 
4 is devoted to expository descriptions of each of these applications, along with the underlying 
physics that is inspired by the proposed thermal equilibrium interpretation. The reader is assumed 
to have very elementary background in statistical physics. 

The remaining part of this paper is organized as follows. In Section 2, we establish some notation 
conventions. In Section 3, we assert and prove our main result, which is the identity between the 
above described Chernoff bounds. Finally, in Section 4, we explore the application examples. 
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2 Notation 



Throughout this paper, scalar random variables (RV's) will be denoted by the capital letters, like 
U,V,X, and Y, their sample values will be denoted by the respective lower case letters, and their 
alphabets will be denoted by the respective calligraphic letters. A similar convention will apply to 
random vectors and their sample values, which will be denoted with same symbols superscripted 
by the dimension. Thus, for example, X" will denote a random n- vector {Xi, . . . and = 

(xi, ■■■,Xn) is a specific vector value in X"", the n-th Cartesian power of X. The notations and 
Xj , where i and j are integers and i < j, will designate segments {xi, . . . , xj) and {Xi, . . . ,Xj), 
respectively, where for i = 1, the subscript will be omitted (as above). Sequences without specifying 
indices are denoted by {•}. Sources and channels will be denoted generically by the letter P or 
Q. Specific letter probabilities corresponding to a source P will be denoted by the corresponding 
lower case letter, e.g., p{v) is the probability of a letter v £ V. A similar convention will be applied 
to a channel Q and the corresponding transition probabilities, e.g., q{u\v), u (z U, v £ V. The 
cardinality of a finite set A will be denoted by Information theoretic quantities like entropies, 
and mutual informations will be denoted following the usual conventions of the information theory 
literature. 

Notation pertaining to statistical physics will also follow, wherever possible, the customary 
conventions. I.e., k will denote Boltzmann's constant (k = 1.38065 x 10"^'^ Joules per Kelvin 
degree), T ~ absolute temperature (in Kelvin degrees), f3 = l/{kT) - the inverse temperature (in 
units of Joule"^ or erg~^), E - energy, the letter Z will be used to denote partition functions, etc. 

3 Main Result 

Let U and V be finitqj sets and let / : ^ x V ^ IR be a given function. Let P = {p{v), w G V} be 
a probability mass function on V and let Q = {q{u\v), u U, v G V} be a matrix of conditional 
probabilities from V to U. 



^The assumption that U is finite, is made mostly for the sake of convenience and simplicity. Most of our results 
extend straightforwardly to the case of a continuous alphabet U. The extension to a continuous alphabet V is 
somewhat more subtle, however. 



4 



Next, let us define for each u G V, the partition function: 

= Y.q{u\v)e-^f^^'^\ /3 > 0, (2) 



and for a given E^, in the range 



let 



min f{u, v) < <y~] q{u\v)f{u, v), (3) 



(E,) = inm[PE, + In (/?)] . (4) 



Further, for a given constant E in the range 

^ p{v) min f{u,v) < £^ < ^ '^p{v)q{u\v)f{u,v), 

let 



S(E) = min 

/3>0 



/3S + ^p(i;)lnZ,(/3) 



(5) 



Let T~i{E) denote the set of all |V|-dimensional vectors E = {E^^ v € V}, where each component 
Ey satisfies and where YlvPi''^)'^^ ^ E. Our main result, in this section, is the following: 

Theorem 1 

max S2piv)S,{E,) = SiE). (6) 

The expression on the right-hand side is, of course, more convenient to work with since it 
involves minimization w.r.t. one parameter only, as opposed to the left-hand side, where there is a 
minimization over P for every v, as well as a maximization over the |V|-dimensional vector E. 

While the proof of Theorem 1 below is fairly short, in the Appendix (subsection A.l), we outline 
an alternative proof which, although somewhat longer, provides some additional insight, we believe. 
As described briefly in the Introduction, it is based on two different approaches to the analysis of 
the rate function, I{E), pertaining to the probability of the event: 



^fiUi,Vi)<nE, (7) 



i=l 
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where {Ui} are RV's taking values in U and drawn according to q{u"'\v^) = q{ui\vi) , and 

= [vi, . . . ,Vn) is a given deterministic vector whose components are in V, with each v E V 
appearing times C^y^yn^ = n), and the related relative frequency, n^/n is exactly p{v). 

It should be noted that the proof in the Appendix pertains to a slightly different definition of the 
set H{E), where the individual upper bound to each E,} is enlarged to max^ /(u, v). Thus, H{E) is 
extended to a larger set, which will be denoted by Ho(-E') in the Appendix. But the maximum over 
Ho{E) is always attained within the original set H(-E) (as is actually shown in the proof below). 



Proof. Here we prove the identity of Theorem 1 directly, without using large deviations analysis 
and Chernoff bounds. We first prove that for every E G H{E), we have J2vgV P('^')^v{Ev) < S{E) 
and then, of course, 

max ^p{v)Sy{Ey) < S{E) 



vev 



as well. This follows from the following chain of inequalities: 



Y,p{v)S^{E^) = Y.p{v)-m.ui[PE^ + \nZ^{l3)] 

vev vev 



vev 



< min 

/3>0 



^ mm[(5piv)E^ + p{v) In (/?)] 

(5Y,P{v)E, + Y,P{v)'^^Z,{(3) 

veV veV 

/3E + 5]p(z;)lnZ„(/3) 



< min 

/?>o 



= ~S{E), 



vev 



(8) 



where in the second inequality we used the postulate that p{v)Ey < E. 

In the other direction, let /3* be the achiever of S{E), i.e., /3* is the solution to the equation: 

d 



E = - 



^p{v)\nZ,{P) 



For each v e V, let E* G [min„ /(u, w ), v)] be chosen such that /3* would be the 

achiever of Sv{E*), i.e., E* = -[01nZ^(/3)/(9/3]^=^*. Obviously, the vector {E*, v e V} lies in 
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n{E), and 



/3=/3* 



dp 



13=13* 



E. 



(9) 



Thus, 



max Y.p{v)S,{E,) > Y,p{v)S,iE:) 



P*Y.p{v)E; + Y.piv)lnZ,{P*) 
P*E + Y,Piv)l^Z,{P*) 

V 

S{E). 



(10) 



This completes the proof of Theorem 1. □ 

The function Zy{P) is similar to the well-known partition function pertaining to the Boltzmann 
distribution w.r.t. the Hamiltonian (energy function) Sviu) = f{u,v), except that each exponential 
term is weighted by q{u\v), as opposed to the usual form, which is just ^^g^^ e"^^"^"^. Before 
describing the statistical mechanical interpretation of eq. ([6|), we should note that Zy{f3) defined in 
([2]) can easily be related to the ordinary partition function, without weighting, as follows: Suppose 
that {q{u\v)^ are rational^ and hence can be represented as ratios of two positive integers, q{u\v) = 
M{u\v)/M, where M > \hl\ is common to al\u ^lA (and v € V). Now, imagine that every value of 
u actually represents a 'quantization' of a more refined microstate (call it a "nanostate") w G W, 
|W| = M, so that u = Qviw), where is a many-to-one function, for which the inverse image of 
every u consists of M{u\v) many values of w. Suppose further that the Hamiltonian depends on 
w only via gviw), i-e., £y{w) = £y{g.u{w)). Then, the (ordinary) partition function related to w is 



Even if not rational, they can always be approximated as such to an arbitrarily good precision. 
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given by 

UP) = j;e-/^^i(-) 




^-P£v(gv{w)) 



= ^M(M|'t;)e-^^''(") 

= M^q{u\v)e-^^^^''^ = MZ^{I3). (11) 
u&A 

Thus, the weighted partition function is, within a constant factor M, the same as the ordinary 
partition function of w. This factor cancels out when probabihties are calculated since it appears 
both in the numerator and the denominator. Moreover, it affects neither the minimizing (3 that 
achieves Sy{Ey) or S{E), nor the derivatives of the log-partition function. 

We now move on to our interpretation of eq. ([6]) from the viewpoint of elementary statistical 
physics: Consider a physical system which consists of [V| subsystems of particles. The total number 
of particles in the system is n and the total amount of energy is nE Joules. For each v G V, the 
subsystem indexed by v (subsystem v, for short) contains = np{v) particles, each of which can lie 
in any microstate within a finite set of microstates U (or an underlying nanostate in a set W), and 
it is characterized by an additive Hamiltonian Svi^i, ■ ■ ■ , Un^) = Y17=i /(^i; The total amount of 
energy possessed by subsystem v is given by n^Ey Joules. As long as the subsystems are in thermal 
isolation from each other, each one of them may have its own temperature = l/(/i;/3„), where 

is the achiever of the normalized (per-particle) entropy associated with an average per~particle 
energy Ey, i.e., 

Sy{Ey) = mm[(3Ey + lnZyi(3)]. 

I3>0 

The above-mentioned rate function I{E) of P^{'}27=1 fiUi,Vi) < nE} is then given by the nega- 
tive maximum total per-particle entropy, p(f )S'„(£'^), where the maximum is over all energy 
allocations {E^} such that the total energy is conserved, i.e., ^i,p{v)Ey = E. This maximum is 
attained by the expression of the r.h.s. of eq. ([6]), where there is only one temperature parameter, 
and hence it corresponds to thermal equilibrium. In other words, the whole system then lies in the 
same temperature T* = l/(/c/3*), where /?* is the minimizer of S{E). Thus, the energy allocation 
among the various subsystems in equilibrium is such that their temperatures are the same (cf. 
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the above proof of Theorem 1). Theorem 1 is then interpreted as expressing the second law of 
thermodynamics. 

At this point, a few comments are in order: 



1. It should be pointed out that in the above physical interpretation, we have implicitly assumed 
that the particles within each subsystem are distinguishable, and so the partition function 
corresponding to a set of n„ particles is given by the partition function of a single particle 
raised to the power of n^, without dividing by n^!. This differs then from the indistinguishable 
case only by a constant factor (as long as is indeed constant) and hence the difference 
between the distinguishable and the indistinguishable cases is not essential for the most part 
of our discussion. 



2. As mentioned in the above paragraph, our conclusion is that I{E) = —S{E). At first glance, 
this may seem peculiar as it appears that I{E) may be negative. However, one should keep in 
mind that S{E) is induced by a (convex) combination of weighted partition functions, rather 
than ordinary partition functions, like CviP)- Referring to eq. (jlip . the ordinary notion of 
entropy S(S) as the normalized log-number of (nano)states with normalized energy E, is 
given by 



'S(E) = min 

/3>0 



PE + Y,Piv)lnUf3) 



mm 

/3>0 



f3E + Y.p{v) In Z^iP) 
S{E) + In M. 



+ In M 



(12) 



Thus, 



which is always non-negative. 



I{E) = lnM-S(£;), 



3. The identity ([6|) can be thought of as a generalized concavity property of the entropy: Had all 
the entropy functions Sv{-) been the same, this would have been the usual concavity property. 
What makes this equality less trivial and more interesting is that it continues to hold even 
when Sv{-), for the various v G V, are different from each other. 
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4. On the more technical level, since this paper draws analogies with physics, we should say a few 
words about physical units. The products (3E, (3E^, Pf{u, v), etc., should all be pure numbers, 
of course. Since /3 = l/{kT), where k is Boltzmann's constant and T is absolute temperature, 
and since kT has units of energy (Joules or ergs, etc.), it is understood that E, E^, f{u, v) and 
the like, should all have units of energy as well. In the applications described below, whenever 
this is not the case, i.e., the latter quantities are pure numbers rather than physical energies, 
we will sometimes reparametrize (3 by f3eo, where eo is an arbitrary constant possessing units 
of energy (e.g., eo = 1 Joule or eo = 1 erg), and we absorb eo in the Hamiltonian, i.e., redefine 
£v{u) = eof{u,v). Thus, in this case, Sv{E), where E is the now the energy in units of eo, is 
redefined as 



This kind of modification is not essential, but it may help to avoid confusion about units 
when the picture is viewed from the aspects of physics. 

4 Applications 

Equipped with the main result of the previous section and its statistical mechanical interpretation, 
we next introduce a few applications that fall within the framework considered. In all these appli- 
cations, there is an underlying large deviations event of the type of eq. whose rate function 
is of interest. The above described viewpoint of statistical physics is then relevant in all these 
applications. 

4.1 The Rate— Distortion Function 

Let P = {p{x), X G X} designate the vector of letter probabilities associated with a given discrete 
memoryless source (DMS), and for a given reproduction alphabet X d : X y< X IR^ denote 
a single-letter distortion measure. Let R{D) denote the rate-distortion function of the DMS P. 

One useful way to think of the rate-distortion function is inspired by the classical random 
coding argument: Let {Xi, . . . ,X„) be drawn i.i.d. from the optimum random coding distribution 
q*{xi, . . . , Xn) = nr=i ^^'^ consider the event Y17=i ^i^ii -^i) ^ "--^j where is a given source 

vector, typical to P, i.e., the composition of x" consists of Ux = np{x) occurrences of each x G X. 



= mm 

/3>0 
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This is exactly an event of the type d?]) with Ui = Xi, Vi = Xi, i = 1, . . . ,n, q{u\v) = q{x\x) = q*{x) 
independently of x, f{u,v) = f{x,x) = d{x,x), and E = D. I.e., the Hamiltonian Sxi^) is given 
by €Qd{x,x) and the total energy is nD in units of eo. 

Suppose that this probability is of the exponential order of e~^^^^\ Then, it takes about 
M = e"''^'-^^'''^' (e > 0, however small) independent trials to 'succeed' at least once (with high 
probability) in having some realization of within distance nD from x". This is the well-known 
the classical random coding achievability argument that leads to I{D) = R{D). Thus, the large- 
deviations rate function of interest agrees exactly with the rate-distortion function (cf. [3l Sect. 
3.4]), which is: 



R(D) = - min 

/3>0 



p{x) In ^ g*(£)e-^"«'='(^'^) 



(13) 



Interestingly, in [101 P- 90, Corollary 4.2.3]), the rate-distortion function is shown, using completely 
different considerations, to have a parametric representation which can be written exactly in this 
form. 

The fact that the rate-distortion function has an interpretation of an isothermal equilibrium 
situation in statistical thermodynamics is not quite new (cf. e.g. [3], Sect. 6.4], [38]). Here, however, 
we obtain it in a more explicit manner and as a special case of a more general principle. 

A simple example is that of the binary symmetric source with the Hamming distortion measure. 
It is easy to see that, in this example, the relationship between distortion and temperature is: 

^ ^° or, equivalents, D = — — ^ . (14) 



k\n[{l-D)/D\ ' ^ i + e^o/{kT) 

and, of course, R{D) = 1 — h2{D), where h2{D) is the binary entropy function. 

A slightly more involved example pertains to the regime of high resolution (small distortion) and 
it turns out to be related to (a generalized version of) the law of equipartition of energy in statistical 
physics: Consider the Lq distortion measure, d{x,x) = \x — x\^ (most commonly encountered are 
the cases 6 = 1 and = 2). Let us assume that D > is very small and consider the (continuous) 
uniform random coding distribution q{x) = in the interval [—A, A\ and zero elsewhere. This 
random coding distribution is suboptimal, but it corresponds, and hence is well motivated, by 
many results in high-resolution quantization using uniform quantizers (see, e.g., [I2j and references 



11 



therein). For every x & X, the partition function is given by 

2A 



1 i'^ 

exp{-/3eo|x - xl'^ldx. 



When D is very smal 
can be approximateqi 
interval [— ^, 



, P is very large, and then the finite-interval integral pertaining to Zx{P) 
by an infinite one, provided that the support of {p{x)} is includecj^ in the 



1 

ZA(3) ^—j exp{-/3eo|x - x|^}dx, (15) 

which then becomes independent of x. The average distortion (internal energy) associated with 
this partition function can be evaluated using the same technique as the one that leads to the law 
of equipartition in statistical physics: 

d 



7^ In 
^n 



/oo 
exp{—f3eo\x — xf}da 
-oo 



exp 



{-eo\/3'^'{x-x)f}d{/3'/'{x-x)) 



In 



d(3 
_d_ 

1 _ kT 



/oo 
exp{— eol^^l^jdz 
-oo 



-oo 

dp 



In 



/oo 
exp{— eol-^l^jdz 
-oo 



(16) 



[Note that for 9 = 2, where the Hamiltonian is quadratic in the integration variable x, this is exactly 
the law of equipartition.] Thus, for low temperatures, the distortion is given by D = kT/(eo9), i.e., 
distortion is linear in temperature in that regime, and the constant of proportionality is related to 
the heat capacity, C = k/9. Since the temperature is proportional to the negative local slope of 
the distortion-rate function (as the reciprocal, /?, is proportional to the negative local slope of the 
rate-distortion function), this means that the distortion is proportional to its derivative w.r.t. R, 
which means an exponential relationship of the form D = DQe~^^ [Dq - constant). For 9 = 2 (mean 
square error) , this is recognized as the well-known characterization of distortion as function of rate 
in the high resolution regime. Specifically, in this case, the factor of 2 at the denominator of kT/2, 
the universal expression of the internal energy per degree of freedom according to the equipartition 
theorem, has the same origin as the factor of 2 that appears in the exponent of D{R) = Doe~^^ 
^See the Appendix (subsection A. 2) for a more rigorous derivation. 

*An alternative, softer condition is that the probabihty that \x\ > A is negligibly small. 
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(decay of 6dB per bit). Thus the law of equipartition in statistical physics is related to the behavior 
of rate distortion codes in the high resolution regime. 

To compute the rate associated with this temperature more explicitly, note that the minimizing 
P* is given by l/{6€oD), and so 



R 



-P*eoD - In 



1 

— J exp{-f3*eo\x - xf}dx 



■i-ln 



1 



In 
In 



2T{i/e) 

AO 



T{i/e)ieeD)y 



AO 



T{i/e) 



-\n{eeD). 



(17) 



4.2 Channel Capacity 



In complete duality to the random coding argument that puts the rate-distortion function in the 
framework discussed in Section 3, a parallel argument can be made with regard to channel capacity. 

Given a discrete memoryless channel (DMC) with a finite input alphabet X, and a finite out- 
put alphabet we can obtain capacity using the following argument. Let {g*(x), x G X} be 
the optimum random coding distribution according to which, each codeword is drawn inde- 
pendently. Let be a given channel output sequence which is typical to the output distribution 
P{y) — ^x^x ^i^)^iv\^)i where {H^(y|x), x € Af, y G 3^} are the channel transition probabilities. 
That is, each symbol y appears Uy = np{y) times in y". Consider now the large deviations event 



1=1 



W{y,\Xi 



< nH{Y\X), 



(18) 



where H{Y\X) = —J2xex'l2y£yli^)^iy\^)^'^S^iy\^)- union bound, as long as the 

number of randomly chosen codewords is exponentially less than e~"^, where I is the rate function 
of the large-deviations event (|18p . then the average error probability still vanishes as n ^ col^ 
Since this is the exactly the achievability argument of the channel coding theorem, then I = C, 
where C the channel capacity. 



^Here we apply the union bound to a threshold decoder that seeks a unique codeword that satisfies p8|) . which 
although suboptimum, is still good enough to achieve capacity. 
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Again, this complies with our model setting with the assignments, Ui = Xi, Vi = yi, i = 
q{u\v) = q{x\y) = q*{x) independently of y, f{u,v) = f{x,y) = — logW{y\x) and E = 
units of eo- In other words, channel capacity can be represented as 



1, . . . ,n, 

HiY\X) 



C = — min 

/3>0 



/3-eoF(yjX) + ^p(y)ln^g*(x)e 



■(3-eol- log W(y\x)] 



) 



(19) 



It is easy to see that, in this case, the equilibrium temperature always corresponds to /?eo = 1, 
namely, T = e^/k. 

By the same token, one can derive an expression of the random coding capacity pertaining to 
mismatched decoding, where the decoder uses an additive metric m{x,y) other than the optimum 
metric, — logW{y\x) (see, e.g., [2], [7], [24], [25], [28], and references therein). The only modifi- 
cations to the above expression would be to replace the Hamiltonian by £y{x) = e()m{x,y) and to 
replace H{Y\X) by the expectation of m{X,Y) w.r.t. q* {x)W {y\x) . The new optimum random 
coding distribution might change as well. Here, it is no longer necessarily true that the equilibrium 
temperature is T = eo/k. 

4.3 Signal Detection and Hypothesis Testing 

Consider the following binary hypothesis testing problem: Given a deterministic signal, which is 
repreresented by a sequence = (xi, . . . ,Xn) with elements taking on values in a (finite) set X 
and relative frequencies {p{x), x G X}, and given an observation sequence y" = {Yi, . . . , 1^), we 
are required to decide between two hypotheses: 

Hq : The observation vector Y^ is "pure noise," distributed according to some product measure 
Q = {liv)^ y £ 3^}, i-e-, qiy'"') = nr=i liVi)^ which is unrelated to x". 

Hi : The observation vector is a "noisy version" of x", distributed according to q{y'^\x'^) = 



The optimum detector (under both the Bayesian and the Neyman-Pearson criterion) compares 
the likelihood ratio ln[g(yi)/g(yj|xj)] to a threshold uEq, and decides in favor of Hq if this 

threshold is exceeded, otherwise, it decides in favor of Hi. 



nr=i9(2/iki)- 
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The false-alarm probability then is the probability of the event 

> m — — ; — - < nhn 

^ [q{Yi\xi)\ - 

under Q. This, again, fits our scenario with the substitutions Ui = Yi, Vi = Xi, i = l,...,n, 
q{u\v) = q{y), independently of x = "i;, f{u,v) = f{y,x) = ln[q{y)/q{y\x)], and E = Eq. Similarly, 
the analysis of the missed-detection probability corresponds to the assignments: Ui = Yi, and 
Vi = Xi, i = 1, . . . ,n, as before, but now q{u\v) = q{y\x), f{u,v) = f{y,x) = ln[q(y\x) / q{y)] and 
E = —Eq. Note that when {q{y)} is the uniform distribution over 3^, the missed-detection event 
can also be interpreted as the probability of excess code-length of an arithmetic lossless source code 
w.r.t. {q{y\x)}. 

Another situation of hypothesis testing that is related to our study in a similar manner is one 
where the signal is always underlying the observations, but the decision to be made is associated 
with two hypotheses regarding the noise level, or the temperature. In this case, there is a certain 
Hamiltonian £x{y) for each x ^ X, and we assume a Boltzmann-Gibbs distribution parametrized 
by the temperature 

'^^^'"'^^^^(^ 

where 

y 

Note that here Cx(/?) is an ordinary partition function, without weighting (cf. We shall also 

denote 

/3ii;+ J^p(x)lnCx(/3) . 

As S(£^) is induced by a convex combination of non- weighted partition functions, it has the sig- 
nificance of the normalized logarithm of the number of microstates with energy about nE. Thus, 
k ■ Y^{E), where k is Boltzmann's constant, is the thermodynamic entropy. 

Given two values /3i and /32 (say, /3i > /32), the hypotheses now are the following: 

li\ : is distributed according to q\[y"'\x"') = HILi ^(yjl^i' A)- 
¥12 '■ Y"- is distributed according to ^2(2/"^"') = YYi=i l{yi\^i^ h)- 



T,(E) = min 

/3>0 
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The likelihood ratio test compares Yl^=i^xi{Yi) to a threshold, uEq, and decides in favor of H2 
if the threshold is exceeded, otherwise, it favors Hi. Here, Eq should lie in the interval {Ei,E2), 
where 



d(3 



, i = l,2. 



For convenience, let us assume now that Ei, i = 0,1,2, and Sx{y) already have units of energy, so 
there is no need to have the constant cq. In this situation, the exponent of the error probability 
under H2 is given by ~S{Eo), where 



SiEo 



mm 

/3>0 



mm 

/3>o 



mm 

l3>o 



mm 

I3>0 



mm 



mm 

/3>/32 



(3Eo + ^ p{x) lnUP + P2)-Yl Pi^) In UP2) 



xex 



+ P2)Eo + p{x) In UP + 132) 



xex 



(32Eo-Ypix)lnU(32) 



x€X 



(3Eo + J2pi^)^^UP) 



xex 



/3£;o + ^p(x)lnC.(/3) 



xex 



+ I32{E2-Eq) 

+ P2{E2-Eo) 



P2E2 + Yp{x)\nUP2) 
xex 



mm 

/3>0 



j:{Eo)-f:iE2)+p2{E2-Eo), 



(20) 



where we have used the fact that the achiever (3{E) of Ti{E) is a monotonically non-increasing 
function of E, thus, -Eq < -£'2 implies I3{Eq) > P{E2) = P2, and so, the global minimum over /3 > 
is attained for (3 > P2 anyway. 
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It then follows that the error exponent I2 under H2 is given by 



h 



nE2)-nEo)-p2{E2-Eo) 



E2 — Eq 



T2 



^Jeo 



T{E) T2 



dE 



1 /'^^ / I 1 

' - - - ) C{T)dT, 



k.lTo \T T2 



(21) 



where T{E) = \/{k(3{E)) is the temperature corresponding to energy E, Ti = T{Ei), i = 0,1,2, 
and C{T) = dE/dT is the average heat capacity per particle of the system, which is the weighted 
average of heat capacities of all subsystems, i.e.. 



where 



Thus, 



Ca.(r) 



1 



d^x 



dr kT"^ 



d'lnCx(/9) 



/3=i/(fer) 



J T2, 

which is interpreted as the weighted average of the relative contributions of all subsystems, which 
all lie in the same temperature Tq. 

In a similar manner, the rate function I\ of the probability of error under H\ is given by: 

h = t{Ei)-t{Eo)-i3i{Ei-Eo) 



1 

k 

kJE, 



kE{Ei) - k^{Eo) 



El — Eq 



Ti 



dE 



Ti T{E) 



(22) 



The expression in the square brackets of the second line pertaining to I2 has a simple graphical 
interpretation (see Fig. 1): It is the vertical distance (corresponding to the vertical line E = Eq) 
between the curve f^{E) and the line tangent to that curve aX E = E2 (whose slope is (^2 = P{E2))- 



17 



The two other expressions of /2, in the last chain of equahties, describe the error exponent I2 in 
terms of slow heating from temperature Tq to temperature T2 . Similar comments apply to I\ (of. 
Fig. 1). Thus, the error exponents are linear functionals of the average heat capacity, C{T), in the 
range of temperatures [Ti,r2]. The higher is the heat capacity, the better is the discrimination 
between the hypotheses. This is related to the fact that Fisher information of the parameter /3 is 
given by 

namely, again, a linear function of C{T). However, while the Fisher information depends only on 
one local value of C{T) (as it measures the sensitivity of the likelihood function to the parameter 
in a local manner), the error exponents depend on {C(T) : Ti < T < in a cumulative manner, 
via the above integrals. The tradeoff between 7i and I2 is also obvious: by enlarging the threshold 
Eq, or, correspondingly, Tq, the range of integration pertaining to Ii increases at the expense of 
the one of I2 and vice versa. In the extreme case, where I2 = 0, we get 

/. = z,(ft||P,) = i£(^-i)c(r)dr. 
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Figure 1: Entropy as function of energy and a graphical representation of error exponents. 
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4.4 Error Exponents of Time— Varying Scalar Quantizers 

In this application example, we are back to the problem area of lossy data compression, but this 
time, it is about scalar (symbol-by-symbol) compression. This setup is motivated by earlier re- 
sults about the optimality of time-shared scalar quantizers within the class of causal source codes 
for memoryless sources, both under the average rate/distortion criteria [33] and large-deviations 
performance criteria [29]. In particular, it was shown that under both criteria, optimum time- 
sharing between at most two (entropy coded) scalar quantizers is as good as any causal source code 
for memoryless sources. Here, we will focus on the large deviations performance criteria, namely, 
source coding exponents. 

Consider a time-varying scalar quantizer Xi = fi{Xi), acting on a DMS Xi,X2, . . ., Xi X , 
drawn from q, where {/j} is an arbitrary (deterministic) sequence of quantizers from a given finite 
set = {Fi, . . . , Fs}, where Fg : X ^ X^, being the reproduction alphabet corresponding to 
Fg, s = 1, . . . , S. In other words, for every i = 1,2, ... ,n, fi = Fg-, for a certain arbitrary sequence 
of 'states', si,S2, ■ ■ ■ (known to the decoder) with components in 5 = {1, 2, ... , S}. 

The distortion incurred by such a time-varying scalar quantizer, over n units of time, is 
Etid{Xi,fiiX,)) = TJUd{Xi,Fg^{Xi)). The total code length is ELi ^s,(Fs,(Xi)), where the 
per-symbol length functions Lg{-) may correspond to either fixed-rate coding, where Lg{x) = Rg = 
[log for all X, or any other length function satisfying the Kraft inequality, ^^^p^ 2~-^'*(^) < 1. 
For the sake of simplicity of the exposition, let us assume fixed-rate coding. We will denote by Ug, 
s £ S, the number of times that Si = s occurs in s"', and p{s) = Hg/n is the corresponding relative 
frequency. 

In [29], among other results, the rate function of the excess distortion event 

n 

Y,d{Xi,Fg^{Xi))>nD, D> ^ q{x)p{s)d{x, Fg{x)) 

was optimized across the class of all time-varying scalar quantizers (each one corresponding to a 
different sequence si, . . . , Sn) subject to a code-length constraint J27=i < nR, or equivalently, 
X^sg^^-s-Rs < nR, for a given pair {D,R). 

In the notation of our generic model, here we have Ui = Xi, Vi = Si, i = 1, . . . ,n, q{u\v) = 
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q{x\s) = q{x) independently of s, f{u,v) = f{x,s) = —d{x, Fs{x)), and E = —D^ and the excess 
distortion exponent is of the same form as before (see also [29] )• Here, however, unlike the previous 
application examples, we have a degree of freedom to select the relative frequency of usage, p{s), of 
each member of i.e., the time-sharing protocol, but we also have the constraint ^sPis)Rs < R- 

From the statistical physics point of view, these additional ingredients mean that we have a 
freedom to select the number of particles in each subsystem (though the total number, n, is still 
fixed), and the additional constraint, '^gPis)Rs < R, which is actually equivalent to the equality 
constraint YlsPi^)'^^ ~ ^ interesting region of {R,D) pairs) can be viewed as an additional 

conservation law with respect to some other constant of motion, in addition to the energy (e.g., the 
momentum), where in subsystem s, the (average) value of the corresponding physical quantity per 
particle is Rs- 

While in [29], we have considered the problem of maximizing the rate function (the source coding 

exponent) of the excess distortion event J2'i=id{Xi, Fs^{Xi)) > nD, a related objective (although 

somewhat less well motivated, but still interesting) is to minimize the rate function (or maximize 

the probability) of the small distortion event 
n 

^d{Xi,Fs^{Xi)) <nD, D< Yl q{x)p{s)d{x,Fs{x)). 
1=1 {x,s)exxs 

In this case, the optimum performance is given by 

s=l \x€X J. 

where V{R) is the class of all probability distributions P = {p{s), s £ S} with ^sP{s)Rs < R- 
From the viewpoint of statistical physics, this corresponds to a situation where the various subsys- 
tems are allowed to interact, not only thermally, but also chemically, i.e., an exchange of particles 
is enabled in addition to the exchange of energy, and the maximization over 'P{R) (maximum en- 
tropy) is achieved when the chemical potentials of the various subsystems reach a balance. As the 
maximization over P € 'P{R) subject to the constraint Y2sP(^)-^s ^ R, ^or a given (3, is a linear 
programming problem with one constraint (in addition to YlsPi^) ~ then as was shown in [29], 
for each distortion level (or energy) D, the optimum P € V{R) may be non-zero for at most two 

One may prefer to redefine f{x,s) = Dmax — d{x, Fs{x)) and E = -Dmax — D, where Dmax ~ max^.s d{x,Fs(x)), 
in order to work with non-negative quantities. 



FiR, D) = max min 
P(iV(R) /3>0 
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members of S only, which means that at most two subsystems are populated by particles in thermal 
and chemical equilibrium under the two conservation laws (of D and of R). However, the choice 
of these two members of S depends, in general, on D, which in turn depends on the temperature. 
Thus, when the system is heated gradually, certain phase transitions may occur, whenever there is 
a change in the choice of the two populated subsystems. 

Finally, referring to comment no. 1 of Section 3, we should point out that here, in contrast 
to our discussion thus far, the difference between the ensemble of distinguishable particles and 
indistinguishable particles becomes critical since the factors {rig!} are no longer constant. Had 
we assumed indistinguishability, the normalized log-partition function would no longer be affine 
in P, thus the maximization over P would no longer be a linear programming problem, and the 
conclusion might have been different. In the source coding problem, the indistinguishable case 
corresponds to a situation where the sequence of states s" is chosen uniformly at random (with 
the decoder being informed of the result of the random selection, of course). In this case, the 
Chernoff bound corresponding to each composition {n^, s G 5} of should be weighed by the 
probability of this composition, which is 'S'""?!!/]^^, nj. Now, each factor of l/n^! can be absorbed 
in the corresponding partition function Zs{0) of subsystem s, with the interpretation that in each 
subsystem the particles are now indistinguishable. The maximum over P would now correspond to 
the dominant contribution in this weighted average of Chernoff bounds. One can, of course, extend 
the discussion to any i.i.d. distribution on s", thus introducing additional bias and preferring some 
compositions over others. 

Appendix 

A.l. Sketch of an Alternative Proof of Theorem 1 via Chernoff Bounds 

In this subsection, we outline another proof of Theorem 1 using a large deviations analysis approach. 
In particular, consider the large deviations event X^"=i /(f^i, ^i) < nE, as described in Section 2. 
Assuming that the relative frequencies {p{v)} all stabilize as n ^ oo, let us compute the rate 
function I{E) of the probability of this event in two different methods, where one would yield the 
left-hand side of ^ and the other would give the right-hand side of ([6]). 

In the first method, we partition the sequence according to its different letters. Specifically, 
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let 

z:Vi=v 

where is the number of occurrences of the symbol f G V along v"'. Let Q denote the set of 
all possible vector values that can be taken on by the vector E = {E^, u G V}. Now, obviously, 
Y17=i fi.Uij'^i) ^ if only if there exists a vector E = {Ev, v G V} £ Q such that Ey < Ey 
for all f G V and ^y^-\;P{v)Ey < E. The "if" part follows from 

n 

'^f{Ui,Vi) = n'^p{v)Ey < n^p{v)Ey < nE. 
1=1 fGV uev 

The "only if" part follows by setting E^ = Ey for all u G V. Therefore, denoting TiciE) = 
TCo{E) f]Q (where Ho{E) is defined as in Section 2), we have: 

Pr\Y,f{Ui,Vi)<nE\ = Ft [j | /([/„ r;) < n„S„, vev\ 



. i=l 



< \nG{E)\ ■ max n /(^^'^) ^ \ 

< \g\- max Y\Pv\ y f{U„v) <nyEA , (A.l) 



and on the other hand. 



Pv\yf{U^,Vi)<nE\ = Pr U < f{U^,v) < n^Ey, veV 

^ E&Hg{E) yi-'"t=-" > 



> _ max 
E&Hg{E) 



\i:Vi=v ) 

max n < nvEy \ . (A.2) 



At this point, the only gap between the upper bound (|A.ip and the lower bound ()A.2p is the factor 
\Q\. The number of different values that Ey can take does not exceed the number of different type 
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classes of sequences of length over the alphabet U, which is upper bounded by (n„ + 1)1^1 ^. 
Thus, 



\Q\ < riK+i]!"!-^ 

= exp|(|W|-l)^log(n, + l)| 

= exp||V|-(|W|-l)j;^logK + l)| 

< exp||V|-(|W|-l)log(^5^i^K + l]^ 



exp^|V|-(|W|-l)log(— + 1 



n 



n 

W\ 



+ 1 



|VK|w|-i) 



(A.3) 



and therefore \Q\ is only polynomial in n, and hence does not affect the exponential behavior. 
Now, each one of the terms P^{'^i-y^=y f{Ui,v) < UyEy} is bounded exponentially tightly by an 
individual Chernoff bound, 







, Hy min 


(5Ey + In ^ 


[ P>0 



and so, the dominant term of their product is of the exponential order of 



max > piv) ■ min 



(3Ey + hi (^g(n)e-^-^("'") 

V u 



= max y^^p{v)Sy{Ey). 

E&Hg{E) ^ 



Finally, as n^, — oo, the set HaiE) becomes dense in the continuous set HoiE), and by simple 
continuity arguments, the maximum over TLciE) tends to the maximum over TCo{E). 

The other method to evaluate the rate function I{E) is as follows. Let £ he a fixed positive 
integer that divides n, and denote £y = £p{v), v e V (assume that £ is chosen large enough that 
£p{v) is well approximated by the closest integer with a very small relative error) . Now, re-order the 
pairs {{Ui,Vi)} (periodically), according to the following rule: Assuming, without loss of generality, 
that V = {1, 2, ... , |V|}, the first £i = £p{l) symbol pairs of each ^-block of (u", v") are such that 
V = 1, the next £2 = £p{'2) symbol pairs of each ^-block are such that u = 2, and so on. In other 
words, each £-block, v'^^^_^-^f^j^^ = (t'(j_i)£_|_i, t;(j_i)^_|_2, • • • , vn), i = 1, 2, . . . , n/£, consists of the same 
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relative frequencies {p{v)} as the entire sequence, v^. Now, for the re-ordered sequence of pairs, 
let us define Xi = YltL{i-i)e+i f{Ut,vt), i = l,2,...,n/£. Obviously, Xi, X2, . . . , are i.i.d. 
and therefore the probability of the large deviations event Xi < j ■ £E} can be assessed 

exponentially tightly by the Chernoff bound as follows: 



n 

exp < — • mm 

^ i /3>0 



I n 

exp < — • mm 

£ /3>0 



n 

exp < — • mm 

' i /3>0 



I n 

exp < — • min 

\ £ /3>0 



exp < n • mm 

[ /9>0 

nSiE) 



(3-£E + \nl qiuV')expl-pYf{ui,Vi) 
P-£E + ln(llY^ q{u'- exp | -/? f{u^, v) 



l3-£E + \n\W 



.ueu 



■f3f(u,v) 



p-£E + £- Yp{v) In [ J2 q{u\v)e-'^f^'''"^ ] 



\ueu 



(A.4) 



Since both approaches yield exponentially tight evaluations of I{E), they must be equal. 
A. 2. A More Rigorous Derivation of Eq. ( 1161) 



The exact derivation of eq. (jl6p for the finite interval integration, is as follows: 



d_ 

'dp 



In 



d_ 

8(3 



In 



/ exp{— /3eo|a; — x|^}dx 
J-A 



/3i/«(A-x) 



^ 1 



-/3i/e(A+z) 
/3i/»(A-x) 

-/3i/e(A+x) 



exp 



{-eo\/3'/'ix-x)f}diP'/\x-x)) 



exp{— eol^^l^jdz 



d_ 

8(3 



In 



9/3 



In 



P^/^{A-x) 
^/3i/e(A+a;) 



exp{— eo|^;|^}d2; 



J_ L _ - x) exp{-/3eoi^ - + {A + x) exp{-/3eo|^ + x|^}] 



.(A.5) 



When /3 is very large, the denominator of the second term of the expression in the curly brackets of 
the right-most side, goes to exp{— eol^^l^jdz, which is a constant. Now if, in addition, \x\ < A, 
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then the numerator tends to zero as (3 grows without bound. Thus, the dominant term, for low 
temperatures, is l/{j39) = kT/6. 

An exact closed-form expression, for every finite /?, can be derived for the case = 1, since in 
this case, the integral at the denominator has a simple expression. For example, setting 9 = 1, and 
X = in the above expression, yields: 

1 A 

kT A 

Note that this expression is valid only in the range where it is monotonically increasing in T. 
(Beyond this point, the minimizing (3 is no longer the point of zero derivative). 
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